518 research outputs found

    Revisiting Pre-Trained Models for Chinese Natural Language Processing

    Full text link
    Bidirectional Encoder Representations from Transformers (BERT) has shown marvelous improvements across various NLP tasks, and consecutive variants have been proposed to further improve the performance of the pre-trained language models. In this paper, we target on revisiting Chinese pre-trained language models to examine their effectiveness in a non-English language and release the Chinese pre-trained language model series to the community. We also propose a simple but effective model called MacBERT, which improves upon RoBERTa in several ways, especially the masking strategy that adopts MLM as correction (Mac). We carried out extensive experiments on eight Chinese NLP tasks to revisit the existing pre-trained language models as well as the proposed MacBERT. Experimental results show that MacBERT could achieve state-of-the-art performances on many NLP tasks, and we also ablate details with several findings that may help future research. Resources available: https://github.com/ymcui/MacBERTComment: 12 pages, to appear at Findings of EMNLP 202

    Transcribing Content from Structural Images with Spotlight Mechanism

    Full text link
    Transcribing content from structural images, e.g., writing notes from music scores, is a challenging task as not only the content objects should be recognized, but the internal structure should also be preserved. Existing image recognition methods mainly work on images with simple content (e.g., text lines with characters), but are not capable to identify ones with more complex content (e.g., structured symbols), which often follow a fine-grained grammar. To this end, in this paper, we propose a hierarchical Spotlight Transcribing Network (STN) framework followed by a two-stage "where-to-what" solution. Specifically, we first decide "where-to-look" through a novel spotlight mechanism to focus on different areas of the original image following its structure. Then, we decide "what-to-write" by developing a GRU based network with the spotlight areas for transcribing the content accordingly. Moreover, we propose two implementations on the basis of STN, i.e., STNM and STNR, where the spotlight movement follows the Markov property and Recurrent modeling, respectively. We also design a reinforcement method to refine the framework by self-improving the spotlight mechanism. We conduct extensive experiments on many structural image datasets, where the results clearly demonstrate the effectiveness of STN framework.Comment: Accepted by KDD2018 Research Track. In proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18

    Simultaneous profiling of transcriptome and DNA methylome from a single cell.

    Get PDF
    BackgroundSingle-cell transcriptome and single-cell methylome technologies have become powerful tools to study RNA and DNA methylation profiles of single cells at a genome-wide scale. A major challenge has been to understand the direct correlation of DNA methylation and gene expression within single-cells. Due to large cell-to-cell variability and the lack of direct measurements of transcriptome and methylome of the same cell, the association is still unclear.ResultsHere, we describe a novel method (scMT-seq) that simultaneously profiles both DNA methylome and transcriptome from the same cell. In sensory neurons, we consistently identify transcriptome and methylome heterogeneity among single cells but the majority of the expression variance is not explained by proximal promoter methylation, with the exception of genes that do not contain CpG islands. By contrast, gene body methylation is positively associated with gene expression for only those genes that contain a CpG island promoter. Furthermore, using single nucleotide polymorphism patterns from our hybrid mouse model, we also find positive correlation of allelic gene body methylation with allelic expression.ConclusionsOur method can be used to detect transcriptome, methylome, and single nucleotide polymorphism information within single cells to dissect the mechanisms of epigenetic gene regulation

    A Span-Extraction Dataset for Chinese Machine Reading Comprehension

    Full text link
    Machine Reading Comprehension (MRC) has become enormously popular recently and has attracted a lot of attention. However, the existing reading comprehension datasets are mostly in English. In this paper, we introduce a Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context. We present several baseline systems as well as anonymous submissions for demonstrating the difficulties in this dataset. With the release of the dataset, we hosted the Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018). We hope the release of the dataset could further accelerate the Chinese machine reading comprehension research. Resources are available: https://github.com/ymcui/cmrc2018Comment: 6 pages, accepted as a conference paper at EMNLP-IJCNLP 2019 (short paper
    corecore